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Multimedia  Macros  for  Portable  Optimized  Programs 

Juan  Carlos  Rojas  and  Miriam  Leeser 
Department  of  Electrical  and  Computer  Engineering 
Northeastern  University,  Boston 

Abstract 

Multimedia  architectures  can  speed-up  applications  significantly  when  programmed  manually.  Optimized  programs 
have  been  non-portable  up  to  now,  because  of  differences  in  instruction  sets,  register  lengths,  alignment 
requirements  and  programming  styles.  We  solve  all  these  problems  by  using  a  library  of  C  pre-processor  macros 
called  MMM.  We  implemented  three  examples  from  video  compression  in  MMM,  and  automatically  translated 
them  into  optimized  code  for  four  distinct  multimedia  processors.  Their  performance  is  comparable,  and  in  several 
cases  better,  than  equivalent  examples  optimized  by  the  processor  vendors. 

Problem 

Multimedia  computing  has  been  one  of  the  greatest  challenges  in  computer  engineering  for  the  last  decade.  Computer  designers  have 
been  challenged  to  come  up  with  solutions  capable  of  processing  the  enormous  amounts  of  data  required  by  multimedia  applications. 
The  solutions  came  in  the  form  of  multimedia  processors,  and  multimedia  extensions  to  general-purpose  processors.  Some  well- 
known  examples  are  AltiVec,  MMX  and  its  successors  -  SSE  and  SSE2,  and  TriMedia  processors. 

All  multimedia  architectures  follow  the  same  basic  approach:  they  partition  the  registers  into  sections  that  represent  multiple  data 
elements,  and  operate  on  all  the  sections  in  parallel.  In  addition,  complex  instructions  have  been  added  to  speed-up  specific  tasks 
found  in  multimedia  applications.  For  example,  some  architectures  include  a  Sum  of  Absolute  Differences  instruction,  or  a  Multiply- 
High  instruction  (multiply  and  pack  the  most- significant  part  of  the  product). 

Our  experiments  and  other  published  results  show  that  multimedia  architectures  can  speed-up  applications  by  factors  of  up  to  15,  but 
manual  optimization  is  required  in  order  to  take  full  advantage  of  the  complex  instructions  available.  Manual  optimization  is  very 
time  consuming,  and  up  to  now  has  resulted  in  non-portable  programs.  This  is  in  part  because  different  multimedia  architectures  have 
different  register  lengths,  different  programming  styles,  different  alignment  requirements,  and  they  support  different  partitioned 
instructions. 

Solution:  MMM 

We  solved  the  problem  by  creating  MMM:  a  library  of  target-independent  C  pre-processor  macros  that  implements  a  common  set  of 
parallel  operations  available  or  efficiently  emulated  on  a  given  set  of  target  architectures.  MMM  provides  a  unique  interface  to 
architectures  with  different  register  lengths  and  instruction  sets.  Long  data  vectors  are  simulated  by  using  several  small  vectors,  and 
operations  of  long  vectors  are  emulated  as  a  sequence  of  operations  on  short  vectors.  Similarly,  vector  operations  that  are  missing  on 
a  given  target  are  emulated  using  a  sequence  of  simple  vector  operations,  when  it  is  efficient  to  do  so.  The  same  concept  can  be  used 
to  resolve  different  alignment  requirements.  Some  architectures  require  that  vector  loads  and  stores  are  done  at  aligned  addresses.  If 
an  unaligned  load  is  required,  one  must  load  two  aligned  vectors  and  compose  the  desired  result  from  them.  All  this  can  be 
encapsulated  inside  MMM  load  macros,  and  thus  provide  the  programmer  with  a  general  unaligned  load  virtual  instruction. 

The  table  below  shows  simplified  MMM  macro  definitions  in  different  targets  of  the  Sum  of  Absolute  Differences  of  two  128-bit 
vectors  partitioned  into  8-bit  sections.  Two  partial  sums  are  returned  in  a  vector. 


SSE2  (128-bit  registers) 

AltiVec  (128-bit  registers) 

#define  SAD_U8xl 6 (a, b, c)  \ 
a  =  _mm_sad_epu8 (b,  c) ; 

#define  SAD_U8xl 6 (a, b, c)  \ 

a  =  vec_sum2s (vec_sum4s (vec_sub (vec_max (b, c) ,  vec_min(b,c) ) ) ; 

MMX+SSE  (64-bit  registers) 

TriMedia  (32-bit  registers) 

#define  SAD_U8xl 6 (a, b, c)  \ 

a##_0  =  _m_psadbw (b##_0 ,  c##_0);  \ 

a##_l  =  _m_psadbw (b##_l ,  c##_l); 

#define  SAD_U8xl 6 (a, b, c)  \ 

a##_0  =  UME8UU(b##_0,  c##_0);  \ 

a##_0  +=  UME8UU (b##_l,  c##_l);  \ 

a##_l  =  UME8UU(b##_2,  c##_2);  \ 

a##_l  +=  UME8UU (b##_3,  c##_3); 

Through  emulation,  MMM  implements  a  large  common  virtual  instruction  set  for  several  target  architectures.  By  using  MMM,  it  is 
possible  to  write  multimedia  applications  that  are  portable  among  different  multimedia  processors,  and  take  advantage  of  the  complex 
partitioned  operations  available  on  them. 
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Programming  Multimedia  Architectures 


■  Many  different  Multimedia  Architectures 
-  TriMedia®,  AltiVec™,  MMX™,  SSE,  SSE2 

■  Different  complex  parallel  instructions 

■  Different  register  lengths,  data  types,  . . . 

■  Goals  for  programming 

■  Portability  aK'¥®£ 

■  Write  application  once 

■  T ranslate  to  any  architecture 

High  Performance 

■  Make  best  use  of  each  architecture 


AltiVec  is  a  trademark  of  Motorola,  Inc.  TriMedia  and  the  TriMedia  logo  are  trademarks  of  TriMedia  Technologies,  Inc. 

The  MMX  logo,  the  Pentium®  III  logo  and  the  Pentium®  4  logo  are  trademarks  of  IntelCorp. 
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Multimedia  Architectures 


Partitioned  registers 


Multimedia  Architectures 
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■  Partitioned  registers 

■  Parallel  operations 


Multimedia  Architectures 
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■  Partitioned  registers 

■  Parallel  operations 

■  Complex  instructions 


Multimedia  Architectures 
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■  Partitioned  registers 

■  Parallel  operations 

■  Complex  instructions 

■  Architectures  differ  in 

■  Register  lengths 

■  I  nstruction  sets 

■  Alignment  requirements 

■  Programming  styles 
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Example:  Vector  SAD 
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Vector  SAD 


uint8  *a,  *b; 
int  diff,  sad; 

sad  =  0; 
for  (i=0;  ±<16; 
{ 

diff  =  a [i] 
sad  +=  diff 


} 


Scalar  I  mplementation 


i++) 

-  b  [i]  ? 

>  0  ?  diff  :  -diff; 
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Vector  SAD:  Optimized  for  Tri Media® 


uint8  *a,  *b; 
int  A,  B,  sad; 

A  =  * ( (int  *)  a) ; 

B  =  * ( (int  *)  b) ; 
sad  =  UME8UU (A,  B) ; 

A  =  * ( (int  *) (a+4) ) ; 

B  =  * ( (int  *)(b+4)); 
sad  +=  UME8UU (A,  B) ; 
A  =  * ( (int  *) (a+8) ) ; 

B  =  * ( (int  *) (b+8 ) ) ; 
sad  +=  UME8UU (A,  B) ; 
A  =  * ( (int  *) (a+12 ) ) ; 
B  =  * ( (int  *) (b+12 ) ) ; 
sad  +=  UME8UU (A,  B) ; 


A 


B 
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Vector  SAD:  Optimized  for  SSE2 


uint8  *a,  *b; 

_ ml28i  A,  B,  C,  D,  E; 

int  sad; 

A  =  _mm_load_sil28 ( 

( _ ml28i  *)  a); 

B  =  _mm_load_sil2 8 ( 

( _ ml28i  *)  b) ; 

C  =  _mm_sad_epu8 (A,  B) ; 

D  =  _mm_srli_sil28 (C,  8); 

E  =  mm  add  ep i 3 2 (C,  D) ; 
sad  =  mm  cvtsil28  si32 (E) ; 
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Vector  SAD:  Optimized  for  AltiVec™ 


uint8  *a,  *b; 
vector  uint8  A,  B,  C,  D; 
vector  uint8  E,  F,  G,  H; 
int  sad; 

A  =  vec_ld ( (vector  uint8  *)  a); 
B  =  vec_ld ( (vector  uint8  *)  b) ; 
C  =  vec_min(A/  B) ; 

D  =  vec_max(A/  B) ; 

E  =  vec_sub(C,  D) ; 

F  =  vec_sum4s (E) ; 

G  =  vec_sums  (F)  ; 

H  =  vec_splat(G,  3); 
vec  ste(H,  &sad) ; 


Solution:  MMM 

■  MMM:  MultiMedia  Macros 

■  I  instruction- level  macro  library 

■  Common  virtual  instruction  set 

■  Emulation  of  long  registers 

■  Emulation  of  instructions 
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Vector  SAD: 


uint8  *a,  *b; 
DECLARE_U8xl 6 (A) 
DECLARE_U8xl 6 (B) 
DECLARE_U3  2x4 (C) 
int  sad; 


LOAD_A_U  8x16 (A,  a) 
LOAD_A_U8xl6 (B,  b) 
SAD2_U8xl6 (C,  A,  B) 
SUM2  U32x4 (sad,  C) 


MMM  Version 


+ 
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MMM  definitions  for  SSE2 


#def ine  DECLARE_U8xl6 (var)  \ 
ml28i  var; 


ttdefine  LOAD_A_U8xl6 (var,  ptr)  \ 

var  =  _mm_load_sil28 ( ( _ ml28i  *)  (ptr)); 


#def ine 
dst 


SAD2_U8xl6 (dst,  srcl,  src2)  \ 
=  _mm_sad_epu8 (srcl,  src2) ; 


#define  SUM2_U32x4 (dst ,  src)  \ 
dst  =  _mm_cvtsil28_si32 (  \ 
_mm_add_epi32 (src,  \ 
mm  srli  sil28(src,  8))); 


r: 
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MMM  definitions  for  AltiVec™ 


#def ine  DECLARE_U8xl6 (var)  \ 
vector  UINT8  var; 

ttdefine  LOAD_A_U8xl6 (var,  ptr)  \ 

var  =  vec  Id ((vector  UINT8  *) (ptr)); 


#define  SAD2_U8xl6 (dst,  srcl,  src2) 
dst  =  vec  sum2s(vec  sum4s( 


#define  SUM2_U32x4 (dst ,  src)  \ 

_ste (vecsplat (  \ 

vec  sums (src),  3),  &dst) ; 


vec 
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MMM  definitions  for  TriMedia® 


#def ine  DECLARE_U8xl6 (var)  \ 
unsigned  int  var##_0;  \ 

unsigned  int  var##_l;  \ 
unsigned  int  var##_2;  \ 

unsigned  int  var##_3; 

#def ine  LOAD_A_U8xl6 (var,  ptr)  \ 

var##_0  =  * ( (int  *)  (ptr));  \ 

var##_l  =  *(((int  *)(ptr))+l);  \ 
var##_2  =  *(((int  *)(ptr))+2);  \ 
var##_3  =  *(((int  *)(ptr))+3); 


ttdefine  SAD2_U8xl6 (dst,  srcl,  src2)  \ 

dst##_0  =  UME8UU (srcl##_0 ,  src2##_0)+  \ 
UME8UU (srcl##_l,  src2##_l) ;  \ 
dst##_2  =  UME8UU (srcl##_2 ,  src2##_2)+  \ 
UME8UU (srcl##  3,  src2##  3) ; 


#define  SUM2_U32x4 (dst ,  src)  \ 
dst  =  src##  0  +  src##  2; 
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Other  Approaches 


■  Parallelizing  compilers 


■  Optimized  kernel  libraries 

■  BLAS,  I  ntel®  I PP,  VSI  PL 


■  Data- parallel  languages 

■  Fortran  90,  SWARC,  Vector  Pascal 

■  C++  SI  MD  classes 


■  Automatic  code  generators 

■  SPI RAL,  FFTW,  ATLAS 

■  None  of  these  approaches  achieves 
performance  and  flexibility  of  MMM 

■  MMM  makes  use  of  complex  instructions  in  each  ISA 
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MMM  Advantages 


■  General  solution 


Complex  applications 


Complex  partitioned  instructions 


■  Hand-coded  performance 
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Our  Approach 


■  Study  representative  architectures: 

■  TriMedia®  TM1300  -  32-bit  registers 

TM 

■  MMX  +SSE  -  64- bit  integer  registers 

■  SSE2  -  128- bit  integer  registers 

■  AltiVec™  -  128- bit  registers 


Define  common  virtual  instruction  set 


■  I  implement  MMM  libraries  for  each  target 

■  I  implement  portable  applications  in  MMM 

■  Measure  performance,  compare  to  reference 
implementations 
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Common  Virtual  I  nstruction  Set 


Vector  types 

18x16  U8xl6  116x8  U16x8 

132x4  U32x4  F32x4 

Vector  declarations 

DECLARE  DECLARECONST 

Set 

SET  SET1  CLEAR  COPY 

Load  and  store 

■  Aligned 

■  Unaligned 

Rearrangement 

INTERLEAVE  BROADCAST  PERMUTE 

Type  conversion 

PACK  EXTEND  CVT 

■  1  nteger  <->  float 

■  Vector  <->  scalar 

Northeastern 


Common  Virtual  I  nstruction  Set 


Shift 

SLL  SRL  SRA  ROL 

SLL_I  SRLI  SRA_I  ROLI 

Bit-wise  logic 

AND  ANDN  OR  XOR  SEL 

Comparison 

CMPEQ  CMPLT  CMPGT 

Float  arithmetic 

ADD  SUB  MULT  MULTADD 

DIV  MIN  MAX  SQRT 

1  nteger  arithmetic 

ADD  SUB  AVG  MIN  MAX 

MULTH  MULTL  MULTADDPAIRS 

SAD 2  SUM2 

Handling  of  overflow: 

■  Modulo 

■  Saturation 

■  Unspecified 

Northeastern 


Example  Programs 

MPEG2  Video  Encoder 
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16x16  Block  L-l- Distance 
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16x16  Block  L-l- Distance 


DECLARE_U8xl 6 (Rl) 

DECLARE_U8xl 6 (I) 

DECLARE_U3  2  x4 (Sad) 

UINT32  Sum; 

CLEARU3  2x4 (Sad) 

P  RE  P ARE_LO AD_AL I GNMENT ( 1 ,  pRef ) 

SAD_ROW ( Sad ,  pRef  +  0*RowPitch,  pin  +  0*RowPitch,  1) 
• 

SAD_ROW ( Sad ,  pRef  +15*RowPitch,  pin  +15*RowPitch,  1) 
SUM2_U32x4 (Sum,  Sad) 

ttdefine  SAD_ROW(dst,  pRef,  pin,  index)  \ 

LOAD_U_U8xl6 (Rl,  pRef,  index)  \ 

LOAD_A_U8xl6 (I,  pin)  \ 

SAD 2  ADD  M  U8xl6 (dst,  Rl,  I,  dst) 


Example  Reference  Versions 


IDCT 

L-l-  Distance 

Tri  Media® 

Case  study 

MMX™+SSE 

Assembly 

Assembly 

C  +  intrinsics 

SSE2 

Assembly 

C++  vector  classes 

C  +  intrinsics 

AltiVec™ 

C  +  intrinsics 

C  +  intrinsics 
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SSE2  Speedups 


Time 

Speedup  =  —  Scalar 


Time 


Optimized 


■  MMM 

■  MMM- Opt 

□  Ref  1 

□  Ref  2 


I DCT  Ll-dist  Ll-dist  Ll-dist  Ll-dist 

shortcut  interpolate  iterp. 

shortcut 
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MMX™ +SSE  Speedups 


■  MMM 

■  MMM- Opt 

□  Ref  1 

□  Ref  2 


I DCT  Ll-dist  Ll-dist  Ll-dist  Ll-dist 

shortcut  interpolate  iterp. 

shortcut 
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AltiVec®  Speedups 


■  MMM 

■  MMM- Opt 

□  Ref  1 

□  Ref  2 


I DCT  Ll-dist  Ll-dist  Ll-dist  Ll-dist 

shortcut  interpolate  iterp. 

shortcut 
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Tri  Media®  Speedups 


7.00 
6.00 
5.00 
4.00 
3.00 
2.00 
1.00 
0.00 

I DCT  Ll-dist  Ll-dist  Ll-dist  Ll-dist 

shortcut  interpolate  iterp. 

shortcut 


■  MMM 

■  MMM-Opt 
□  Reference 


Conclusions  and  Future  Work 

■  MMM  =  Portable  +  Optimized 

■  Diverse  architectures 

■  Complex  examples,  complex  instructions 

■  Hand-coded  performance 

■  Within  12%  of  best 

■  Solution  can  be  applied  to  other  I  SAs 

■  SI  MD  &  DSP 

■  Future  Work: 

■  Address  ease  of  programming  issues 

■  MMC:  Multimedia  C 


Vector  SAD:  MMC  Version 
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uint8  *a,  *b; 
u8xl6  A,  B; 
u32x4  C; 
int  sad; 

A  =  *a; 

B  =  *b ; 

C  =  SAD 2  (A,  B)  ; 
sad  =  SUM 2 (C) ; 


Presentation 


Other  Approaches 
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Parallelizing  compilers  can  generate  some  multimedia  instructions  from  scalar  code,  but  not  the  most  complex  ones.  The  problem  is 
that  it  is  not  easy  to  express  these  complex  parallel  instructions  in  C.  One  can  also  write  parallel  programs  explicitly  using  a  data- 
parallel  language,  but  this  still  does  not  solve  the  problem  of  expressing  complex  parallel  instructions.  Not  even  languages 
specifically  designed  for  multimedia  [1,2]  can  express  complex  operations  like  Sum  of  Absolute  Differences,  or  Multiply-High. 

Multimedia  code  can  also  be  generated  from  abstract  descriptions,  like  in  SPIRAL  [4].  This  approach  is  complementary  to  MMM: 
the  code  generator  can  experiment  with  different  algorithm  designs,  and  prototype  them  using  MMM.  Another  possibility  is  to  write 
multimedia  applications  based  on  optimized  libraries  that  conform  to  a  standardized  API,  like  VSIPL  [3].  This  is  a  good  approach  for 
certain  classes  of  applications,  but  not  as  flexible  as  MMM. 


Experiments 

We  implemented  an  MMM  library  for  four  distinct  multimedia  architectures:  AltiVec,  MMX+SSE,  SSE2,  and  TriMedia  TM1300. 
These  architectures  are  very  diverse.  Their  register  lengths  vary  from  32  to  128  bits,  they  have  distinct  instruction  sets,  alignment 
requirements  and  programming  styles. 

Then  we  implemented  three  example  programs  on  MMM  used  in  video  compression:  8x8  Inverse  Discrete  Cosine  Transform  (IDCT), 
16x16  Sum  of  Absolute  Differences  (SAD),  and  16x16  SAD  with  horizontal  and  vertical  interpolation.  Through  MMM  libraries,  the 
same  programs  were  automatically  converted  into  optimized  code  for  all  four  target  platforms. 

We  measured  the  execution  time  and  instruction  count  of  the  MMM  programs  on  all  four  targets,  and  compared  them  with  equivalent 
programs  hand-optimized  by  each  processor  vendor.  In  some  cases  our  programs  out-performed  the  vendor  examples,  so  we 
attempted  to  further  optimize  our  programs  for  each  target  using  non-portable  instructions,  to  serve  as  references.  We  compared  all 
the  optimized  programs  with  reference  scalar  implementations,  and  computed  the  speedup  and  the  reduction  in  instruction  counts.  The 
following  charts  show  the  speedup  of  the  portable  MMM  versions  compared  to  the  best  known  optimized  versions  for  each  target: 


The  portable  MMM  programs  obtained  speedups  that  are  comparable  to  the  best  known  hand-optimized  versions  for  each  target.  In 
some  cases,  they  provide  the  best  known  performance.  The  portable  programs  written  in  MMM  are  indeed  optimized  for  several 
architectures  at  the  same  time. 

Conclusions 

It  is  possible  to  write  portable-optimized  multimedia  programs  using  MMM.  These  programs  are  portable  among  a  diverse  group  of 
architectures  that  have  different  register  lengths,  instruction  sets,  alignment  requirement  and  programming  styles.  The  performance  of 
portable  MMM  programs  is  comparable  to  the  best  known  implementations  for  each  target. 
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